Terrorism Data Analysis Project ¶

Grey Kienzle

CMSC320

Table of Contents¶

  1. Introduction
  2. Data Collection and Processing
  3. Exploratory Data Analysis
  4. Hypothesis Testing
  5. Conclusion

Introduction ¶

Back to Table of Contents

Terrorist attacks, defined by the UMD START Consortium as "the threatened or actual use of illegal force and violence by a nonstate actor to attain a political, economic, religious, or social goal through fear, coercion, or intimidation", are a tragic and difficult reality that we face in our modern world. Lives, families, and societies are demolished by these events, and their prolificity is a stain on world relations. The goal of this project is to describe trends in terrorist activity between 1970 and 2021 across the world in order to inform the global community as to the scope and impacts of such activity.

Throughout this tutorial, we will analyze trends in terrorist activity, including between the types of attacks and their locations, the types of attacks and frequency of attacks, and frequency of attacks and change in Gross Domestic Product.

The Python libraries used in this project include:

  • pandas, used to input and process tabular data
  • openpyxl, an input engine for spreadsheet files required by pandas for read_excel
  • numpy, used along with pandas to process large data
  • country_converter, a library for converting country names to ISO3 codes for easier data merging
  • logging, used to suppress messages from country_converter
  • matplotlib, used to generate and clean plots describing trends we find
  • geopandas, used to add a geometry to a pandas DataFrame in order to plot onto simple maps using longitude and latitude data
  • sklearn, used to make models of data

The atypical packages we are using, openpyxl, country_converter, and geopandas can be installed here in Jupyter:

In [1]:
%%capture
!pip3 install openpyxl country_converter geopandas

Data Collection and Processing ¶

Back to Table of Contents

In the Data Collection phase, we collect data and "tidy" it to allow us to easily analyze it later.

Terrorist attack data was retrieved from the UMD START Consortium's Global Terrorism Database, which provides an .xlsx file containing detailed data regarding terrorist attacks which have happened between 1970 and 2021, including date, time, and location of the event, type(s) of attack, estimated amount of injuries and lives lost, and estimated value of property damage.

GDP data was retrieved from the World Bank, which provides a .csv data containing GDP data for each country in each year.

In these first steps, we will import each dataset, modify the shape of each dataset in order to make it easier to process later, add or modify columns of each dataset, and combine the two datasets into one cumulative dataset.

We first import all necessary libraries:

In [2]:
import pandas as pd
import numpy as np
import country_converter as coco
import logging
import matplotlib.pyplot as plt
import geopandas
from sklearn import linear_model

And we will import the GDP dataset:

In [3]:
gdp_data = pd.read_csv("final_data/gdp.csv")
gdp_data.head()
Out[3]:
Country Name Country Code Indicator Name Indicator Code 1960 1961 1962 1963 1964 1965 ... 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
0 Aruba ABW GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 2.615084e+09 2.727933e+09 2.791061e+09 2.963128e+09 2.983799e+09 3.092179e+09 3.202235e+09 3.310056e+09 2.496648e+09 NaN
1 Africa Eastern and Southern AFE GDP (current US$) NY.GDP.MKTP.CD 2.129059e+10 2.180847e+10 2.370702e+10 2.821004e+10 2.611879e+10 2.968217e+10 ... 9.730435e+11 9.839370e+11 1.003679e+12 9.242525e+11 8.823551e+11 1.020647e+12 9.910223e+11 9.975340e+11 9.216459e+11 1.082096e+12
2 Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD 5.377778e+08 5.488889e+08 5.466667e+08 7.511112e+08 8.000000e+08 1.006667e+09 ... 1.990732e+10 2.014640e+10 2.049713e+10 1.913421e+10 1.811656e+10 1.875347e+10 1.805323e+10 1.879945e+10 2.011614e+10 NaN
3 Africa Western and Central AFW GDP (current US$) NY.GDP.MKTP.CD 1.040414e+10 1.112789e+10 1.194319e+10 1.267633e+10 1.383837e+10 1.486223e+10 ... 7.275704e+11 8.207927e+11 8.649905e+11 7.607345e+11 6.905464e+11 6.837487e+11 7.416899e+11 7.945430e+11 7.844457e+11 8.358084e+11
4 Angola AGO GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 1.249982e+11 1.334016e+11 1.372444e+11 8.721929e+10 4.984049e+10 6.897276e+10 7.779294e+10 6.930910e+10 5.361907e+10 7.254699e+10

5 rows × 66 columns

Wow, that's ugly! It seems like each year is a column in this dataset, which makes it more difficult for us to model anything related to time, as we do not have any Year column to regress over. Hence, we have to melt (same as pivot_longer in R) all of these numerical columns into a single Year column while maintaining all other columns:

In [4]:
gdp_data = gdp_data.melt(id_vars=["Country Name", "Country Code", "Indicator Name", "Indicator Code"], var_name="Year", value_name="GDP")
gdp_data.head()
Out[4]:
Country Name Country Code Indicator Name Indicator Code Year GDP
0 Aruba ABW GDP (current US$) NY.GDP.MKTP.CD 1960 NaN
1 Africa Eastern and Southern AFE GDP (current US$) NY.GDP.MKTP.CD 1960 2.129059e+10
2 Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD 1960 5.377778e+08
3 Africa Western and Central AFW GDP (current US$) NY.GDP.MKTP.CD 1960 1.040414e+10
4 Angola AGO GDP (current US$) NY.GDP.MKTP.CD 1960 NaN

Much better! Each row is now a single observation of one country's GDP in one year.

However, note that column names in Pandas are stored as strings, so all of the columns we parsed as Years are in string format. We can verify this by checking the data types of the column using pandas's dtypes attribute of DataFrames:

In [5]:
gdp_data.dtypes
Out[5]:
Country Name       object
Country Code       object
Indicator Name     object
Indicator Code     object
Year               object
GDP               float64
dtype: object

Note that the Year column has the object data type, which is not an integer, like we would have hoped. Thus, we have to change each entry in this column to an integer, which we can do using the .apply() method of DataFrames and the built-in int() function which converts its input to an integer:

In [6]:
gdp_data["Year"] = gdp_data["Year"].apply(int)
gdp_data.dtypes
Out[6]:
Country Name       object
Country Code       object
Indicator Name     object
Indicator Code     object
Year                int64
GDP               float64
dtype: object

Now Year has type int64.

Now, we will finally add a gdp_change column indicating how much GDP dropped from the current year to the next, which we will use to see if there exists a link between high amounts of terrorist activity and a given year and a subsequent drop in GDP over the next year:

In [7]:
gdp_data["gdp_change"] = gdp_data.groupby('Country Code')["GDP"].diff(periods=-1) * -1
gdp_data.head()
Out[7]:
Country Name Country Code Indicator Name Indicator Code Year GDP gdp_change
0 Aruba ABW GDP (current US$) NY.GDP.MKTP.CD 1960 NaN NaN
1 Africa Eastern and Southern AFE GDP (current US$) NY.GDP.MKTP.CD 1960 2.129059e+10 5.178878e+08
2 Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD 1960 5.377778e+08 1.111108e+07
3 Africa Western and Central AFW GDP (current US$) NY.GDP.MKTP.CD 1960 1.040414e+10 7.237596e+08
4 Angola AGO GDP (current US$) NY.GDP.MKTP.CD 1960 NaN NaN

Now we will import the START terrorism dataset:

In [8]:
terrorism_data = pd.read_excel("final_data/globalterrorismdb_0522dist.xlsx")
terrorism_data.head()
Out[8]:
eventid iyear imonth iday approxdate extended resolution country country_txt region ... addnotes scite1 scite2 scite3 dbsource INT_LOG INT_IDEO INT_MISC INT_ANY related
0 197000000001 1970 7 2 NaN 0 NaT 58 Dominican Republic 2 ... NaN NaN NaN NaN PGIS 0 0 0 0 NaN
1 197000000002 1970 0 0 NaN 0 NaT 130 Mexico 1 ... NaN NaN NaN NaN PGIS 0 1 1 1 NaN
2 197001000001 1970 1 0 NaN 0 NaT 160 Philippines 5 ... NaN NaN NaN NaN PGIS -9 -9 1 1 NaN
3 197001000002 1970 1 0 NaN 0 NaT 78 Greece 8 ... NaN NaN NaN NaN PGIS -9 -9 1 1 NaN
4 197001000003 1970 1 0 NaN 0 NaT 101 Japan 4 ... NaN NaN NaN NaN PGIS -9 -9 1 1 NaN

5 rows × 135 columns

Note that we have 135 columns in this dataset, the documentation for which is found in the GTD codebook.

We wish to combine this dataset with the GDP dataset, though we cannot do so as easily as we would like. We hope to do so by merging the two datasets by matching rows between certain columns, though the country names in each dataset are distinct. Observe:

In [9]:
gdp_data[gdp_data["Country Name"] == "Korea, Rep."]["Country Name"].unique()
Out[9]:
array(['Korea, Rep.'], dtype=object)
In [10]:
terrorism_data[terrorism_data["country_txt"] == "South Korea"]["country_txt"].unique()
Out[10]:
array(['South Korea'], dtype=object)

Note that the GDP dataset uses the formal name of "Korea, Rep." in reference to South Korea, whereas the terrorism dataset uses simply "South Korea". These inconsistencies are annoying, but we can also take advantage of standardized country codes and merge on that. The GDP dataset already includes ISO3 codes for each entry, so we just have to add them to the terrorism dataset, which we will do using country_converter:

In [11]:
cc = coco.CountryConverter()
coco_logger = coco.logging.getLogger()
coco_logger.setLevel(logging.CRITICAL)

terrorism_data["country_code"] = cc.pandas_convert(series=terrorism_data["country_txt"], to='ISO3', not_found=np.nan)
terrorism_data.head()
Out[11]:
eventid iyear imonth iday approxdate extended resolution country country_txt region ... scite1 scite2 scite3 dbsource INT_LOG INT_IDEO INT_MISC INT_ANY related country_code
0 197000000001 1970 7 2 NaN 0 NaT 58 Dominican Republic 2 ... NaN NaN NaN PGIS 0 0 0 0 NaN DOM
1 197000000002 1970 0 0 NaN 0 NaT 130 Mexico 1 ... NaN NaN NaN PGIS 0 1 1 1 NaN MEX
2 197001000001 1970 1 0 NaN 0 NaT 160 Philippines 5 ... NaN NaN NaN PGIS -9 -9 1 1 NaN PHL
3 197001000002 1970 1 0 NaN 0 NaT 78 Greece 8 ... NaN NaN NaN PGIS -9 -9 1 1 NaN GRC
4 197001000003 1970 1 0 NaN 0 NaT 101 Japan 4 ... NaN NaN NaN PGIS -9 -9 1 1 NaN JPN

5 rows × 136 columns

However, note that country_converter is not perfect, so we have some countries which could not be converted into ISO3 codes. We cannot do anything with these, so we will discard these data when dealing with GDP:

In [12]:
terrorism_data[terrorism_data["country_code"].isna()]["country_txt"].value_counts()
Out[12]:
West Germany (FRG)                541
Yugoslavia                        203
Soviet Union                       78
East Germany (GDR)                 38
Serbia-Montenegro                  11
People's Republic of the Congo      4
International                       1
Name: country_txt, dtype: int64

Now we will add a column with text representations of the coded attacktype1 column, which corresponds to the type of terrorist event that occurred:

In [13]:
atk_type_keys = ['Assassination/Assassination Attempt', 
                 'Armed Assault', 
                 'Bombing/Explosion', 
                 'Hijacking', 
                 'Hostage Taking (Barricading)', 
                 'Hostage Taking (Kidnapping)', 
                 'Facility/Infrastructure Attack', 
                 'Unarmed Assault (inc. Chem., Bio., Rad. attacks)', 
                 'Unknown']

terrorism_data["atk1_txt"] = [atk_type_keys[atk-1] for atk in terrorism_data["attacktype1"]]

Now we can finally merge our two datasets, including only entries found in the terrorism dataset (as we do not need any GDP data outside of our bounds):

In [14]:
gdp_terrorism = terrorism_data.merge(gdp_data, how="left", left_on=["iyear","country_code"], right_on=["Year","Country Code"])
gdp_terrorism.head()
Out[14]:
eventid iyear imonth iday approxdate extended resolution country country_txt region ... related country_code atk1_txt Country Name Country Code Indicator Name Indicator Code Year GDP gdp_change
0 197000000001 1970 7 2 NaN 0 NaT 58 Dominican Republic 2 ... NaN DOM Assassination/Assassination Attempt Dominican Republic DOM GDP (current US$) NY.GDP.MKTP.CD 1970.0 1.485500e+09 1.810000e+08
1 197000000002 1970 0 0 NaN 0 NaT 130 Mexico 1 ... NaN MEX Hostage Taking (Kidnapping) Mexico MEX GDP (current US$) NY.GDP.MKTP.CD 1970.0 3.552000e+10 3.680000e+09
2 197001000001 1970 1 0 NaN 0 NaT 160 Philippines 5 ... NaN PHL Assassination/Assassination Attempt Philippines PHL GDP (current US$) NY.GDP.MKTP.CD 1970.0 7.559180e+09 8.159065e+08
3 197001000002 1970 1 0 NaN 0 NaT 78 Greece 8 ... NaN GRC Bombing/Explosion Greece GRC GDP (current US$) NY.GDP.MKTP.CD 1970.0 1.313986e+10 1.451886e+09
4 197001000003 1970 1 0 NaN 0 NaT 101 Japan 4 ... NaN JPN Facility/Infrastructure Attack Japan JPN GDP (current US$) NY.GDP.MKTP.CD 1970.0 2.126092e+11 2.754262e+10

5 rows × 144 columns

Exploratory Data Analysis ¶

Back to Table of Contents

In this stage of the data analysis pipeline, we plot different aspects of our data in order to better understand the data and discover potential trends which might exist within the data. We can map data, plot data over time, and perform rudimentary statistical analyses to better inform our hypothesis testing later.

We will first use geopandas to plot our START data on a world map in order to see how different types of attacks are distributed across the world. To do this, we first convert our data into a GeoDataFrame, which gives each datum a value for its geometry, in this case a point with the latitude and longitude of the event:

In [15]:
gdf = geopandas.GeoDataFrame(
    gdp_terrorism, geometry=geopandas.points_from_xy(gdp_terrorism.longitude, gdp_terrorism.latitude))

gdf.head()
Out[15]:
eventid iyear imonth iday approxdate extended resolution country country_txt region ... country_code atk1_txt Country Name Country Code Indicator Name Indicator Code Year GDP gdp_change geometry
0 197000000001 1970 7 2 NaN 0 NaT 58 Dominican Republic 2 ... DOM Assassination/Assassination Attempt Dominican Republic DOM GDP (current US$) NY.GDP.MKTP.CD 1970.0 1.485500e+09 1.810000e+08 POINT (-69.95116 18.45679)
1 197000000002 1970 0 0 NaN 0 NaT 130 Mexico 1 ... MEX Hostage Taking (Kidnapping) Mexico MEX GDP (current US$) NY.GDP.MKTP.CD 1970.0 3.552000e+10 3.680000e+09 POINT (-99.08662 19.37189)
2 197001000001 1970 1 0 NaN 0 NaT 160 Philippines 5 ... PHL Assassination/Assassination Attempt Philippines PHL GDP (current US$) NY.GDP.MKTP.CD 1970.0 7.559180e+09 8.159065e+08 POINT (120.59974 15.47860)
3 197001000002 1970 1 0 NaN 0 NaT 78 Greece 8 ... GRC Bombing/Explosion Greece GRC GDP (current US$) NY.GDP.MKTP.CD 1970.0 1.313986e+10 1.451886e+09 POINT (23.76273 37.99749)
4 197001000003 1970 1 0 NaN 0 NaT 101 Japan 4 ... JPN Facility/Infrastructure Attack Japan JPN GDP (current US$) NY.GDP.MKTP.CD 1970.0 2.126092e+11 2.754262e+10 POINT (130.39636 33.58041)

5 rows × 145 columns

We can now then plot our data on a world map, which geopandas provides us through its builtin naturalearth_lowres shapefile. We can make a base plot from this and then plot each of our data points on this map, which geopandas takes care of for us given the new geometry column added in the previous step. We refer to our atk1_txt column to allow geopandas to color-code the graph points based on the values of that column. We make the points semi-transparent (15\% opacity) so we can visually estimate the density of large clusters of points, we change the limits of the graph to contain exactly the set of valid latitudes and longitudes, and we add a legend to the right of the map:

In [16]:
ax = gdf.plot("atk1_txt", 
                ax= geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres')).plot(color='whitesmoke', edgecolor='black', figsize=(50,30)),
                categorical=True, 
                legend=True, 
                cmap="Set1", 
                legend_kwds={'loc': 'center left', 
                             'bbox_to_anchor':(1,0.5), 
                             'markerscale':2, 
                             'fontsize':20}, 
                s=10, 
                alpha=0.15)

ax.set_xlim(-180, 180)
ax.set_ylim(-90, 90)

leg = ax.get_legend()

for lh in leg.legendHandles: 
    lh.set_alpha(1)

plt.show()

It is rather difficult to pick out details in this map due to the concentration of many of the attacks. For example, note that Northern Ireland is quite densely filled with bombings, explosions, and armed assaults (see The Troubles), and Nicaragua is densely filled with armed assaults (see Nicaraguan Revolution and the Contra War), though it is difficult to see how each country's attacks are distributed within each country, as we are looking at a world map as opposed to a more localized map. Hence, we can also make maps for individual regions and continents in order to more closely understand the scope of terrorist activity in finer regions.

We will start with North America, which we will plot in exactly the same way as our world map but, by changing the x-limits and y-limits of what is plotted, we "zoom in" on this region:

In [17]:
def plot_data_with_limits(xrange,yrange):
    ax = gdf.plot("atk1_txt", 
                  ax= geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres')).plot(color='whitesmoke', edgecolor='black', figsize=(50,30)),
                  categorical=True, 
                  legend=True, 
                  cmap="Set1", 
                  legend_kwds={'loc': 'center left', 
                               'bbox_to_anchor':(1,0.5), 
                               'markerscale':2, 
                               'fontsize':20}, 
                  s=10, 
                  alpha=0.15)

    ax.set_xlim(xrange)
    ax.set_ylim(yrange)

    leg = ax.get_legend()

    for lh in leg.legendHandles: 
        lh.set_alpha(1)

    plt.show()

plot_data_with_limits((-170, -50),(0, 90))

From the above map, we can see that most armed assault attacks in Nicaragua occurred in the northwestern portion of the country, whereas many bombings were scattered across the Pacific coast.

We now plot South America:

In [18]:
plot_data_with_limits((-90, -30),(-60, 15))

Note that here, we can see a pattern of bombings across western Colombia and armed assaults focused around Ayacucho, Peru.

We now plot Europe:

In [19]:
plot_data_with_limits((-20, 50),(30, 75))

Note that most armed assaults in Northern Ireland happened around Belfast, and explosions tended to happen in the southeast portion of the country. Note that many attacks were also localized in Kosovo (see Kosovo War) and eastern Ukraine (see Russo-Ukrainian War pre-2022).

We now plot Africa:

In [20]:
plot_data_with_limits((-25, 55),(-40, 40))

We see here a lot of armed assault in northeastern Nigeria as well as bombings across southern Somalia. Also observe that most attacks in Egypt are concentrated about the Nile River, where most Egyptian population centers are located.

We now plot the Middle East and Southwestern Asia:

In [21]:
plot_data_with_limits((25, 80),(10, 43))

Note that the West Bank area in the southern Levant is filled with armed assaults, whereas most of the Mediterranean coast of Israel and the Gaza Strip is densely filled with bombings and explosions. Observe also that much of Iraq, Pakistan, Afghanistan, and Yemen have quite dense concentrations of explosions and bombings.

We now plot central and East Asia:

In [22]:
plot_data_with_limits((50, 180),(0, 80))

Note that around Manila in the Philippines, there is a high density of assassinations and attempted assassinations, likely related to the many political assassinations speculated to have been committed by allies of Ferdinand Marcos Sr.. Note that such a high density of assassinations is unseen elsewhere on the map.

We finally plot Australia and Oceania:

In [23]:
plot_data_with_limits((80, 180),(-50, 10))

After using maps to visualize where attacks are located, we might wish to more rigorously see where most attacks are located, which we can do by filtering our dataframe for the countries which appear the most in our data:

In [24]:
gdf[gdf["country_txt"].isin(gdf["country_txt"].value_counts().head().index)].groupby("country_txt")["atk1_txt"].count().sort_index()
Out[24]:
country_txt
Afghanistan    18920
Colombia        8915
India          13929
Iraq           27521
Pakistan       15504
Name: atk1_txt, dtype: int64

From now on, we will focus on these five above countries, as well as Nicaragua, the United Kingdom (which contains Northern Ireland), and the Philippines, all of which were briefly discussed above, for more country-specific analyses. Hence, we will filter our dataframe for only these countries:

In [25]:
gdf_filtered = gdf[gdf["country_txt"].isin(["Afghanistan", "Colombia", "India", "Iraq", "Pakistan", "Nicaragua", "United Kingdom", "Philippines"])]
gdf_filtered.head()
Out[25]:
eventid iyear imonth iday approxdate extended resolution country country_txt region ... country_code atk1_txt Country Name Country Code Indicator Name Indicator Code Year GDP gdp_change geometry
2 197001000001 1970 1 0 NaN 0 NaT 160 Philippines 5 ... PHL Assassination/Assassination Attempt Philippines PHL GDP (current US$) NY.GDP.MKTP.CD 1970.0 7.559180e+09 8.159065e+08 POINT (120.59974 15.47860)
26 197001210001 1970 1 21 NaN 0 NaT 160 Philippines 5 ... PHL Bombing/Explosion Philippines PHL GDP (current US$) NY.GDP.MKTP.CD 1970.0 7.559180e+09 8.159065e+08 POINT (121.05750 14.67428)
39 197001310001 1970 1 31 NaN 0 NaT 160 Philippines 5 ... PHL Unknown Philippines PHL GDP (current US$) NY.GDP.MKTP.CD 1970.0 7.559180e+09 8.159065e+08 POINT (120.33162 15.67505)
96 197003000001 1970 3 0 NaN 0 NaT 160 Philippines 5 ... PHL Bombing/Explosion Philippines PHL GDP (current US$) NY.GDP.MKTP.CD 1970.0 7.559180e+09 8.159065e+08 POINT (120.97867 14.59605)
150 197003240001 1970 3 24 NaN 0 NaT 160 Philippines 5 ... PHL Unknown Philippines PHL GDP (current US$) NY.GDP.MKTP.CD 1970.0 7.559180e+09 8.159065e+08 POINT (120.59194 15.15300)

5 rows × 145 columns

Now we might wish to see how each country's GDP changes over time:

In [26]:
fig, ax = plt.subplots()

for idx, gp in gdf_filtered[["country_txt", "Year", "GDP"]].drop_duplicates().groupby("country_txt"):
    gp.plot(x='Year', y='GDP', ax=ax, label=idx, logy=True)

ax.set_ylabel("Yearly GDP, modern $US (log scale)")

ax.set_title("GDP in 2022 USD in Selected Countries Over Time")
    
ax.legend(loc="center left", bbox_to_anchor=(1,0.5))
    
plt.show()

Observe that Iran's GDP drops drastically in the early 90s (maybe due to the Gulf War), and Nicaragua's GDP drops slightly in the late 80s, possibly due to the aforementioned Nicaraguan Revolution. We can further explore these trends as they relate to terrorist activity in our Hypothesis Testing phase.

Now let's plot year-after-year change in GDP over time in our selected countries:

In [27]:
fig, ax = plt.subplots()

for idx, gp in gdf_filtered[["country_txt", "Year", "gdp_change"]].drop_duplicates().groupby("country_txt"):
    gp.plot(x='Year', y='gdp_change', ax=ax, label=idx)

ax.set_ylabel("Change in Yearly GDP (next - current), modern $US")

ax.set_title("Change in GDP in 2022 USD Over Time")
    
ax.legend(loc="center left", bbox_to_anchor=(1,0.5))
    
plt.show()

Now we wish to get a cumulative total of terrorist attacks in our eight countries to see if they line up with our theories about the GDP data:

In [28]:
filtered_attack_counts = gdf_filtered[["attacktype1","country_txt", "Year", "Country Code"]].groupby(["country_txt", "Country Code", "Year"]).agg({'count'}).unstack().fillna(0)
filtered_attack_counts.columns = filtered_attack_counts.columns.droplevel(0).droplevel(0)
filtered_attack_counts = filtered_attack_counts.reset_index().melt(id_vars=["country_txt", "Country Code"], var_name="Year", value_name="count")
filtered_attack_counts["Year"] = filtered_attack_counts["Year"].apply(int)
filtered_attack_counts
Out[28]:
country_txt Country Code Year count
0 Afghanistan AFG 1970 0.0
1 Colombia COL 1970 1.0
2 India IND 1970 0.0
3 Iraq IRQ 1970 0.0
4 Nicaragua NIC 1970 1.0
... ... ... ... ...
395 Iraq IRQ 2020 764.0
396 Nicaragua NIC 2020 3.0
397 Pakistan PAK 2020 294.0
398 Philippines PHL 2020 294.0
399 United Kingdom GBR 2020 90.0

400 rows × 4 columns

And we can plot count over time:

In [29]:
_, ax = plt.subplots()

for idx, gp in filtered_attack_counts.drop_duplicates().groupby("country_txt"):
    gp.plot(x='Year', y='count', ax=ax, label=idx)

ax.set_ylabel("Amount of Terrorist Attacks in Given Year")

ax.set_title("Amount of Terrorist Attacks in Selected Countries Over Time")
    
ax.legend(loc="center left", bbox_to_anchor=(1,0.5))

plt.show()

Observe that the vast majority of terrorist attacks happening in Iraq, Afghanistan, Pakistan, India, and the Philippines have happened within about the past 15-20 years, which, specifically for Iraq, is outside of the range of its large GDP drop. We can confirm or reject any possible trends between terrorist activity and GDP in Hypothesis Testing, though.

Now let's see if there may be trends regarding which attack types happened more frequently in certain years:

In [30]:
from matplotlib.colors import rgb2hex

cmap = plt.get_cmap("Set1")

def colors_at_breaks(cmap, breaks):
    return [rgb2hex(cmap(bb)) for bb in breaks]

breaks = [n/9+1/18 for n in range(9)]

colors = colors_at_breaks(cmap, breaks)

attack_type_counts = gdf_filtered[["atk1_txt","country_txt", "Year"]].groupby(["atk1_txt", "Year"]).agg({'count'}).unstack().fillna(0)
attack_type_counts.columns = attack_type_counts.columns.droplevel(0).droplevel(0)
attack_type_counts = attack_type_counts.reset_index().melt(id_vars="atk1_txt", var_name="Year", value_name="count")
attack_type_counts["Year"] = attack_type_counts["Year"].apply(int)

_, ax = plt.subplots()


for (idx, gp), color in zip(attack_type_counts.drop_duplicates().groupby("atk1_txt"), colors):
    gp = pd.DataFrame(gp)
    gp.plot('Year', 'count', ax=ax, label=idx, c=color)
    
ax.legend(loc="center left", bbox_to_anchor=(1,0.5))

ax.set_ylabel("Amount of Terrorist Attacks in Given Year")

ax.set_title("Amount of Types of Terrorist Attacks Over Time")

plt.show()

Now that we better understand the data, we can move to Hypothesis Testing to more formally confirm or reject our ideas.

Hypothesis Testing ¶

Back to Table of Contents

In this phase of the data science pipeline, we use various models and statistical tests to both verify trends we might have observed in our data exploration phase as well as extend those trends to predict outside of our dataset.

We will first test if there are links between change in GDP and the amount of terrorist attacks we see within our eight selected countries. We can first explore these trends by scattering GDP with respect to terrorist attacks in each of our countries and using a regression model to see if a trend exists within each country. We can make each scatterplot as normal, then we can fit a linear regression model to each subset of the data using sklearn, which we can then plot on the same axes as we made our scatterplot.

Each model produces an r$^2$ value, which represents the percentage of variance in the target variable which can be accounted for by our independent variables. Generally, changing model parameters in a manner that increases the r$^2$ value increases the quality of a model, or how "good" it is.

Below, we merge our selected countries data to include both attack counts and GDP, and for each of our selected countries, we make a scatter plot and fit a Scikit-Learn Linear Regression model to all non-missing datapoints. We then plot this line on the same graph and save our r$^2$ values in a dictionary where it can be referenced later by the country whose model it belongs to.

In [31]:
attack_counts_with_gdp = filtered_attack_counts.merge(gdp_data, how="left", left_on=["Year","Country Code"], right_on=["Year","Country Code"])

r_2_values = {}

for index, gp in attack_counts_with_gdp.groupby("country_txt"):
    ax = gp.plot.scatter("count", "gdp_change", figsize=(10,8))
    

    ax.set_xlabel("Amount of Terrorist Attacks in Given Year")
    ax.set_ylabel("Change in Yearly GDP (next - current), modern $US")

    ax.set_title("Change in GDP by Amt. of Terr. Attacks in " + index)
    
    non_na_counts = []
    non_na_gdp = []
    
    for a, b in zip(gp["count"].values.reshape(-1, 1),gp["gdp_change"].values.reshape(-1, 1)):
        if not np.isnan(a) and not np.isnan(b):
            non_na_counts.append(a)
            non_na_gdp.append(b)
    
    clf = linear_model.LinearRegression()
    clf.fit(non_na_counts, non_na_gdp)
    predicted = clf.predict(non_na_counts)
    
    ax.plot(non_na_counts,predicted, c="red")
    
    r_2 = clf.score(non_na_counts,non_na_gdp)
    
    r_2_values[index] = r_2

Now we can see our r$^2$ values:

In [32]:
for key, value in r_2_values.items():
    print(f"The r^2 value for the {key} model relating change in GDP with amount of terrorist attacks is {(value):.7f}, \n\t\
    so {(value*100):.5f}% of the variance in year-to-year change in GDP between data points can be attributed to the amount of terrorist attacks occurring in that year.\n")
The r^2 value for the Afghanistan model relating change in GDP with amount of terrorist attacks is 0.0339466, 
	    so 3.39466% of the variance in year-to-year change in GDP between data points can be attributed to the amount of terrorist attacks occurring in that year.

The r^2 value for the Colombia model relating change in GDP with amount of terrorist attacks is 0.0422305, 
	    so 4.22305% of the variance in year-to-year change in GDP between data points can be attributed to the amount of terrorist attacks occurring in that year.

The r^2 value for the India model relating change in GDP with amount of terrorist attacks is 0.1908505, 
	    so 19.08505% of the variance in year-to-year change in GDP between data points can be attributed to the amount of terrorist attacks occurring in that year.

The r^2 value for the Iraq model relating change in GDP with amount of terrorist attacks is 0.0000125, 
	    so 0.00125% of the variance in year-to-year change in GDP between data points can be attributed to the amount of terrorist attacks occurring in that year.

The r^2 value for the Nicaragua model relating change in GDP with amount of terrorist attacks is 0.0454709, 
	    so 4.54709% of the variance in year-to-year change in GDP between data points can be attributed to the amount of terrorist attacks occurring in that year.

The r^2 value for the Pakistan model relating change in GDP with amount of terrorist attacks is 0.1337640, 
	    so 13.37640% of the variance in year-to-year change in GDP between data points can be attributed to the amount of terrorist attacks occurring in that year.

The r^2 value for the Philippines model relating change in GDP with amount of terrorist attacks is 0.0748030, 
	    so 7.48030% of the variance in year-to-year change in GDP between data points can be attributed to the amount of terrorist attacks occurring in that year.

The r^2 value for the United Kingdom model relating change in GDP with amount of terrorist attacks is 0.0138532, 
	    so 1.38532% of the variance in year-to-year change in GDP between data points can be attributed to the amount of terrorist attacks occurring in that year.

Note that for some of these countries, there exists a positive correlation between terrorist activity and yearly change in GDP, which we would not expect. Hence, we might suspect that there is some lurking variable that affects both change in GDP and amount of terrorist activity. One such variable might be Year, so we'll repeat our above steps while also including Year in our linear models.

In [33]:
r_2_values = {}

for index, gp in attack_counts_with_gdp.groupby("country_txt"):
    
    non_na_counts = []
    non_na_gdp = []
    non_na_years = []
    
    for a, b, c in zip(gp["count"].values.reshape(-1, 1),gp["gdp_change"].values.reshape(-1, 1),gp["Year"].values.reshape(-1, 1)):
        if not np.isnan(a) and not np.isnan(b):
            non_na_counts.append(a)
            non_na_gdp.append(b)
            non_na_years.append(c)
    
    X = pd.DataFrame({"counts":non_na_counts, "years":non_na_years})
    
    clf = linear_model.LinearRegression()
    clf.fit(X, non_na_gdp)
    [[m_count,m_year]] = clf.coef_
    [b] = clf.intercept_
    
    print(f"{index} Change in GDP = {m_count:.0f} * attack_count + {m_year:.0f} * year - {-b:.0f}")
    
    r_2 = clf.score(X, non_na_gdp)
    
    r_2_values[index] = r_2
Afghanistan Change in GDP = -1367563 * attack_count + 56786062 * year - 112027805758
Colombia Change in GDP = -32586933 * attack_count + 169264141 * year - 326041133226
India Change in GDP = 28019464 * attack_count + 3276383273 * year - 6483061436543
Iraq Change in GDP = -2621742 * attack_count + 259035383 * year - 511309926848
Nicaragua Change in GDP = -914302 * attack_count + 8054741 * year - 15810674834
Pakistan Change in GDP = 5335389 * attack_count + 221041197 * year - 435923784511
Philippines Change in GDP = -8748330 * attack_count + 469776536 * year - 928267738752
United Kingdom Change in GDP = -235260828 * attack_count + 109292141 * year - 132560598083
In [34]:
for key, value in r_2_values.items():
    print(f"The r^2 value for the {key} model relating change in GDP with amount of terrorist attacks and Year is {(value):.7f}, \n\t\
    so {(value*100):.5f}% of the variance in year-to-year change in GDP between data points can be attributed to either Year or the amount of terrorist attacks occurring in that year.\n")
The r^2 value for the Afghanistan model relating change in GDP with amount of terrorist attacks and Year is 0.4245499, 
	    so 42.45499% of the variance in year-to-year change in GDP between data points can be attributed to either Year or the amount of terrorist attacks occurring in that year.

The r^2 value for the Colombia model relating change in GDP with amount of terrorist attacks and Year is 0.0548931, 
	    so 5.48931% of the variance in year-to-year change in GDP between data points can be attributed to either Year or the amount of terrorist attacks occurring in that year.

The r^2 value for the India model relating change in GDP with amount of terrorist attacks and Year is 0.2562162, 
	    so 25.62162% of the variance in year-to-year change in GDP between data points can be attributed to either Year or the amount of terrorist attacks occurring in that year.

The r^2 value for the Iraq model relating change in GDP with amount of terrorist attacks and Year is 0.0061370, 
	    so 0.61370% of the variance in year-to-year change in GDP between data points can be attributed to either Year or the amount of terrorist attacks occurring in that year.

The r^2 value for the Nicaragua model relating change in GDP with amount of terrorist attacks and Year is 0.0898135, 
	    so 8.98135% of the variance in year-to-year change in GDP between data points can be attributed to either Year or the amount of terrorist attacks occurring in that year.

The r^2 value for the Pakistan model relating change in GDP with amount of terrorist attacks and Year is 0.1713876, 
	    so 17.13876% of the variance in year-to-year change in GDP between data points can be attributed to either Year or the amount of terrorist attacks occurring in that year.

The r^2 value for the Philippines model relating change in GDP with amount of terrorist attacks and Year is 0.2696958, 
	    so 26.96958% of the variance in year-to-year change in GDP between data points can be attributed to either Year or the amount of terrorist attacks occurring in that year.

The r^2 value for the United Kingdom model relating change in GDP with amount of terrorist attacks and Year is 0.0139395, 
	    so 1.39395% of the variance in year-to-year change in GDP between data points can be attributed to either Year or the amount of terrorist attacks occurring in that year.

Our models are not perfect, but we can see now that generally, keeping other factors constant, if the amount of terrorist attacks in a given year increases, GDP will increase by a lesser amount in that year. Note however that Year is a much better predictor; looking at Afghanistan, for example, we could see that 3.39466% of the variance in year-to-year change in GDP between data points can be attributed to the amount of terrorist attacks occurring in that year, whereas 42.45499% of the variance in year-to-year change in GDP between data points can be attributed to either Year or the amount of terrorist attacks occurring in that year, implying that in Afghanistan, Year is a more reliable metric for the prediction of GDP than amount of terrorist attacks. We still can work on improving these models; if we can eliminate as many lurking variables for amount of terrorist attacks as we can, we will have a more accurate model, though this would require more data.

Now we will pivot to assessing if certain types of attacks happen more or less frequently over time. We can repeat similar steps as above, though using Year as the independent variable and count of each category of attack as our dependent variable. Recall from above our plot of terrorist attack categories over time. We wish to see if there exists a linear relationship between Year and yearly count of any of these attack types, which would allow us to see which types of attacks are becoming more or less common.

In [35]:
r_2_values = {}

for index, gp in attack_type_counts.groupby("atk1_txt"):
    
    clf = linear_model.LinearRegression()
    clf.fit(gp["Year"].values.reshape(-1, 1), gp["count"].values.reshape(-1, 1))
    [[m_year]] = clf.coef_
    [b] = clf.intercept_
    
    print(f"Amount of {index} Attacks = {m_year:.5f} * year - {-b:.0f}")
    
    r_2 = clf.score(gp["Year"].values.reshape(-1, 1), gp["count"].values.reshape(-1, 1))
    
    r_2_values[index] = r_2
Amount of Armed Assault Attacks = 24.76444 * year - 48960
Amount of Assassination/Assassination Attempt Attacks = 5.79495 * year - 11353
Amount of Bombing/Explosion Attacks = 69.33689 * year - 137322
Amount of Facility/Infrastructure Attack Attacks = 4.62134 * year - 9144
Amount of Hijacking Attacks = 0.22425 * year - 443
Amount of Hostage Taking (Barricading) Attacks = 0.29176 * year - 574
Amount of Hostage Taking (Kidnapping) Attacks = 8.44190 * year - 16706
Amount of Unarmed Assault (inc. Chem., Bio., Rad. attacks) Attacks = 0.51827 * year - 1025
Amount of Unknown Attacks = 9.31523 * year - 18468
In [36]:
for key, value in r_2_values.items():
    print(f"The r^2 value for the {key} model relating amount of that type of terrorist attack with Year is {(value):.7f}, \n\t\
    so {(value*100):.5f}% of the variance in amount of that type of terrorist attack between data points can be attributed to Year.\n")
The r^2 value for the Armed Assault model relating amount of that type of terrorist attack with Year is 0.5684934, 
	    so 56.84934% of the variance in amount of that type of terrorist attack between data points can be attributed to Year.

The r^2 value for the Assassination/Assassination Attempt model relating amount of that type of terrorist attack with Year is 0.2994787, 
	    so 29.94787% of the variance in amount of that type of terrorist attack between data points can be attributed to Year.

The r^2 value for the Bombing/Explosion model relating amount of that type of terrorist attack with Year is 0.5176269, 
	    so 51.76269% of the variance in amount of that type of terrorist attack between data points can be attributed to Year.

The r^2 value for the Facility/Infrastructure Attack model relating amount of that type of terrorist attack with Year is 0.5104329, 
	    so 51.04329% of the variance in amount of that type of terrorist attack between data points can be attributed to Year.

The r^2 value for the Hijacking model relating amount of that type of terrorist attack with Year is 0.3731653, 
	    so 37.31653% of the variance in amount of that type of terrorist attack between data points can be attributed to Year.

The r^2 value for the Hostage Taking (Barricading) model relating amount of that type of terrorist attack with Year is 0.2058557, 
	    so 20.58557% of the variance in amount of that type of terrorist attack between data points can be attributed to Year.

The r^2 value for the Hostage Taking (Kidnapping) model relating amount of that type of terrorist attack with Year is 0.5940840, 
	    so 59.40840% of the variance in amount of that type of terrorist attack between data points can be attributed to Year.

The r^2 value for the Unarmed Assault (inc. Chem., Bio., Rad. attacks) model relating amount of that type of terrorist attack with Year is 0.2960495, 
	    so 29.60495% of the variance in amount of that type of terrorist attack between data points can be attributed to Year.

The r^2 value for the Unknown model relating amount of that type of terrorist attack with Year is 0.3877592, 
	    so 38.77592% of the variance in amount of that type of terrorist attack between data points can be attributed to Year.

Note that generally, among each category of terrorist attack, there is unfortunately a positive correlation between Year and amount of that category of terrorist attack. For example, we can expect about 25 more Armed Assault terrorist attacks in each subsequent year, almost 6 more Assassinations/Attempts in each subsequent year, and nearly 70 more Bombing/Explosion attacks in each subsequent year compared to the current year. Note also that the r$^2$ values are generally high; more than half (51.76%) of all year-to-year variance across the amount of Bombings and Explosions can be attributed to simply the year itself. Again referring to our above plot, we can observe sharp increases in terrorist attacks and activity post-9/11, implying that especially recently, terrorist activity has unfortunately started to drastically increase with time, a quite sobering thought.

Conclusion ¶

Back to Table of Contents

Through our data exploration and analysis, we can observe that:

  1. Terrorist activity (and types of activity) is not uniformly distributed throughout the world, and there are some "hotspots" which have higher levels of some kinds of activity. Through the maps made in the exploratory phase, we can note multiple countries rife with terrorist activity as well as some which remain relatively peaceful.
  2. According to regression models trained in our hypothesis testing phase, the amount of terrorist acivity in a given year has a mild negative correlation with yearly change in GDP among most of our eight selected countries, though across all selected countries, Year is a much better indicator (though we assume Year has some lurking variables that must be addressed if we are to formally model change in GDP).
  3. According to the second batch of regression models trained in our hypothesis testing phase, terrorist activity of all types has a positive correlation with respect to Year, or all types of terrorist activity are increasing over time.

Using this information, we conclude that terrorist activity has a relatively small but not insignificant impact on GDP, so if we are to continue to productively improve as a global community, we must reduce the amount of terrorist attacks happening in the world by seeking peace when possible. This analysis provides an interesting insight to the impacts of terrorism on the world, and although country-specific GDP is not currently taking a particularly large hit as a result of increased terrorist activity, we must work towards a more peaceful world before the impact of war and terror is too detrimental to world economies and societies. Loss of life, liberty, and public infrastructure are painful impacts of terrorist activity, and we as a world will be truly better if we can overcome this plague.

In [ ]: